{
 "cells": [
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "",
   "id": "17e8c47647a3eece"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "LangChain implements a **Document** abstraction, which is intended to represent a unit of text and associated metadata. It has three attributes:\n",
    "\n",
    "* page_content: a string representing the content;\n",
    "* metadata: a dict containing arbitrary metadata;\n",
    "* id: (optional) a string identifier for the document."
   ],
   "id": "5eb02f8defe3461a"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "from langchain_core.documents import Document\n",
    "\n",
    "documents = [\n",
    "    Document(\n",
    "        page_content=\"Dogs are great companions, known for their loyalty and friendliness.\",\n",
    "        metadata={\"source\": \"mammal-pets-doc\"},\n",
    "    ),\n",
    "    Document(\n",
    "        page_content=\"Cats are independent pets that often enjoy their own space.\",\n",
    "        metadata={\"source\": \"mammal-pets-doc\"},\n",
    "    ),\n",
    "]"
   ],
   "id": "a14bf8131cb573c4"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "use a simple text splitter that partitions based on characters. We will split our documents into chunks of 1000 characters with 200 characters of overlap between chunks.\n",
    "\n",
    "We set add_start_index=True so that the character index where each split Document starts within the initial Document is preserved as metadata attribute “start_index”. "
   ],
   "id": "a16391cd9f80f5b8"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
    "\n",
    "text_splitter = RecursiveCharacterTextSplitter(\n",
    "    chunk_size=1000, chunk_overlap=200, add_start_index=True\n",
    ")\n",
    "all_splits = text_splitter.split_documents(docs)\n",
    "\n",
    "len(all_splits)"
   ],
   "id": "cdc4f057c164f5fc"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "from langchain_core.vectorstores import InMemoryVectorStore\n",
    "\n",
    "vector_store = InMemoryVectorStore(embeddings)"
   ],
   "id": "84d1806744627637"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "import faiss\n",
    "from langchain_community.docstore.in_memory import InMemoryDocstore\n",
    "from langchain_community.vectorstores import FAISS\n",
    "\n",
    "embedding_dim = len(embeddings.embed_query(\"hello world\"))\n",
    "index = faiss.IndexFlatL2(embedding_dim)\n",
    "\n",
    "vector_store = FAISS(\n",
    "    embedding_function=embeddings,\n",
    "    index=index,\n",
    "    docstore=InMemoryDocstore(),\n",
    "    index_to_docstore_id={},\n",
    ")"
   ],
   "id": "a87738f054fedf29"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": "ids = vector_store.add_documents(documents=all_splits)",
   "id": "15da5abef92d3e05"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "results = vector_store.similarity_search(\n",
    "    \"How many distribution centers does Nike have in the US?\"\n",
    ")\n",
    "\n",
    "print(results[0])"
   ],
   "id": "e801c1d20a6c8d55"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "# Note that providers implement different scores; the score here\n",
    "# is a distance metric that varies inversely with similarity.\n",
    "\n",
    "results = vector_store.similarity_search_with_score(\"What was Nike's revenue in 2023?\")\n",
    "doc, score = results[0]\n",
    "print(f\"Score: {score}\\n\")\n",
    "print(doc)"
   ],
   "id": "3b72386ebd723785"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "embedding = embeddings.embed_query(\"How were Nike's margins impacted in 2023?\")\n",
    "\n",
    "results = vector_store.similarity_search_by_vector(embedding)\n",
    "print(results[0])"
   ],
   "id": "2806e88e6f18bb5"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "from typing import List\n",
    "\n",
    "from langchain_core.documents import Document\n",
    "from langchain_core.runnables import chain\n",
    "\n",
    "\n",
    "@chain\n",
    "def retriever(query: str) -> List[Document]:\n",
    "    return vector_store.similarity_search(query, k=1)\n",
    "\n",
    "\n",
    "retriever.batch(\n",
    "    [\n",
    "        \"How many distribution centers does Nike have in the US?\",\n",
    "        \"When was Nike incorporated?\",\n",
    "    ],\n",
    ")"
   ],
   "id": "45243f0be628787"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "retriever = vector_store.as_retriever(\n",
    "    search_type=\"similarity\",\n",
    "    search_kwargs={\"k\": 1},\n",
    ")\n",
    "\n",
    "retriever.batch(\n",
    "    [\n",
    "        \"How many distribution centers does Nike have in the US?\",\n",
    "        \"When was Nike incorporated?\",\n",
    "    ],\n",
    ")"
   ],
   "id": "b1a9e7ca9bb7c3e8"
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
